



## A bit of history

- Cell is short for Cell Broadband Engine Architecture
  - Aka CBEA, or CellBE
- Joint venture of Sony, Toshiba, and IBM
  - First design meetings in 2001
  - Over 400 engineers
  - First chips out in 2005
- Priorities set at design stage:
  - Optimize performance per watt
  - Optimize bandwidth over latency
  - Optimize total throughput over ease of programming

3



## **Cell based products**

- Sony Playstation 3
- IBM QS20, QS21, QS22 blades
- Several others on plan
  - Toshiba HDTV set
  - Leadtek PCIe graphics accelerator board
  - ..



4

## Roadrunner: breaking the Petaflop barrier

- First computer to break the Petaflop barrier
- Installed at Los Alamos National Laboratory (LANL)
- Built by IBM for the U.S. Department of Energy National Nuclear Security Administration
  - Aging of nuclear materials (nuclear weapons stock)
- Hybrid architecture
  - 12.960 IBM QS22 PowerXCell 8i
  - 6.480 AMD Opteron dual-core
  - Specially designed tri-blade computing node



5



## **Protein Folding @ Home**

- Distributed computing problem managed by Stanford University
  - Protein folding
  - Molecular dynamics
- August 2006: PS3 version launched
- October 2006: GPU version launched
- Large contribution to TFLOPS from only a few clients

| Platform | TFLOPS      | NCPU           |
|----------|-------------|----------------|
| NVIDIA   | 2.182 (43%) | 18.333 (4.6%)  |
| PS3      | 1.387 (28%) | 49.180 (12.4%) |
| TOTAL    | 5.033       | 394.853        |



## **Outline** Introduction Multicore architectures Cell architecture Overview Memory architecture PPE architecture SPE architecture

- Element Interconnection Bus
- · Cell-based systems
- Programming models
  - Streaming
  - Fork-join
  - Task queue























## **Outline**

- Introduction
  - Multicore architectures

## Cell architecture

- Overview
- Memory architecture
- PPE architecture
- SPE architecture
- Element Interconnection Bus
- Cell-based systems
- Programming models
  - Streaming
  - Fork-join
  - Task queue

10



## Cell Processor: 9 heterogeneous cores

- 1 Master Processor (PPE)
  - 2-way SMT
  - 2-way in-order issue
  - Minimal branch prediction
  - VMX support
- 8 Accelerators (SPE)
  - 2-way in-order issue
  - no branch prediction ("hints")
  - SIMD ISA
  - 256KB local memory
  - DMA controller
- 512KB L2 cache
- 4-ring interconnection network (EIB)
- Integrated memory and I/O controllers
- Simple design due to power/area restrictions





















## EIB performance greatly depends on physical layout





- Experiments run on a prototype QS20 blade @ 2.1 GHz
- 4 SPE transfer data to their "right" or "left" neighbor
  - 4 couples transfer data back + forth
- Vertical lines show variation across 10 different executions

29



## **EIB Performance greatly suffers from conflicts**





- Experiments run on a prototype QS20 blade @ 2.1 GHz
- All 8 SPE rend data to both their right + left neighbors
  - 2 cycles of 8 SPE transfer data in each direction
- EIB performance drops by ~30% vs couples setup
- Much less variation depending on placement
  - · All setups have too many conflicts

Descriptors Expansionality

30







# Outline Introduction Cell architecture Overview Memory architecture PPE architecture SPE architecture Element Interconnection Bus Programming models Streaming Fork-join Task queue



## **Functional partitioning**

- Exploits control-level parallelism
- Code is partitioned in a dataflow fashion
  - Each step in data processing is split as a separate filter
- Data is transferred directly from one filter to the next
  - Data communication streams
- Optimizes data transfer locality
  - Stream header fetches data from off-chip memory
  - Data moves directly from local memory to local memory
    - No off-chip temporary result storage
  - Stream tail transfers results back to off-chip memory
- · Does not scale with data size
  - Parallelism determined by algorithm complexity, not data size







## **Cell Superscalar Programming Model**

- Task-based programming model for the Cell Processor.
  - Parallel computation on PPE and SPEs from sequential code by adding pragmas.
- Pragma annotations applied to functions to define them as tasks.
- Program execution starts on the PPE, which spawns tasks to SPEs on annotated function calls.

39



## CellSs task management and execution



- Master thread (PPE 1st SMT slot):
  - Starts main execution.
  - On an annotated function call, creates the task and adds it to the dependence graph.
- Helper thread (PPE 2<sup>nd</sup> SMT slot):
  - Schedules tasks from the dependence graph.
  - Groups tasks in bundles (default: 8 tasks/bundle).
  - Dispatches task bundles to SPEs.
- Worker threads (SPEs):
  - Execute tasks and notify task finalization to the Helper thread.







## How to increase the parallelism?

- Max Active tasks = 
   task execution time

  task generation time
- Increase the task size
  - This would not work for application with
    - Fixed total problem size
      - $\bullet \ \, \text{Larger tasks} \rightarrow \text{Less tasks (ie. less parallelism)}$
    - Fixed task size
      - E.g. multimedia applications work on fixed-size block (H.264,MPEG-4)
  - Task size is also limited by the data that fits in the Local Store
  - NOT A GLOBAL SOLUTION
- Faster task generation
  - Faster PPE: faster processor, better caches, ...
  - Quantitative evaluation is very important: power and area restrictions

43



## Out of order execution gets 50% speedup | Internal | I

- All in-order and out-of-order configurations are evaluated.
- Out of order execution improves performance by a 50% on average.
  - More functional units provides up to 10% extra improvement
- This results shows that the task generation phase on CellSs is very sensitive to memory latency on cache misses.

Description Engagement Control





## Local memory for task generation improves out-of-order performance by 20%



- As seen in the previous experiments, the data used by task generation could fit in a 2-4MB local memory.
- Such a large on-chip memory would be slower, so latencies from 8 to 256 cycles are evaluated and compared to the ooo-2 configuration.
- Having a local memory would improve out-of-order performance by close to 20% even with a 64-cycle latency on average.

47



## Conclusions (I)

- Cell is a 9-processor heterogeneous CMP
  - Multiple ISA
  - Distributed memory
- Programming Cell is difficult
  - Distributed memory programming model
    - Must partition + transfer data to processors
    - Performance heavily depends on good use of the DMA engine
  - DMA programming harder than it should be
    - Lots of implementation-specific restrictions
  - SIMD-only SPU instruction set
    - Automatic vectorization not up to the challenge
    - Poor performance on scalar / control code
      - Even worse since there is no branch prediction
  - VLIW pipeline on the SPU
    - Instruction scheduler not always good enough
  - Poor PPE performance



## Conclusions (II)

- But, there is nothing wrong with the concept!
  - Heterogeneous CMP, multiple ISA
    - Assign each task to the most adequate processor
  - Distributed memory
    - Higher scalability + efficiency
- Lots of room for improvement on the implementation
  - Easier ISA on the SPU side
    - Automatic code generation by the compiler
  - Avoid architecture-driven restrictions
    - Even if they have a cost in performance?
  - Provide at least one high-performance processor
    - Task generation, sequential parts of the code

Demokrati Super-resignating Super-resignating Super-resident and accordance